Some Basics of Network Analysis

Network analysis is all about connections (surprise). What sets network analytic approaches apart form ‘standard’ quantitative research is really the inherent inter-dependence of the data. Meaning, observations only exist BECAUSE they are connected. For example, multi-national trade relations require at least two countries to trade goods. Otherwise we can not observe the phenomenon.

In network terms, each country in the trade example would be called a node (or vertex, pl. vertices) and the trade through which the are connected would be an edge (or tie). Nodes and edges can have additional features, as we will see below.

In network analysis, researchers can be interested in the position of specific nodes or in the overall network structure. There are several quantitative measures you can calculate for this, some of which are introduced below. The network measures can then be both explanatory or dependent variables, depending on your research question.

Network Analysis in R

There are several packages for network analysis in R:

  • igraph (most popular, syntax rather cryptic, incl. plotting functions)
  • network (largely overlaps with igraph, syntax is more straight-forward, part of the ‘statnet family’)
  • tidygraph (builds on igraph and includes most of its functionality, tidy syntax, no plotting functions)
  • ggraph (for plotting, syntax like ggplot, mostly used together with tidygraph)
  • visNetwork, threejs, networkD3, ndtv-d3 (interactive plotting, order ascending in complexity)

This tutorial will only feature tidygraph, ggraph, and visNetwork. See the Resources section for more great material, also for the other packages.

Contents of Networks & Basic Graphs

That networks consist of two types of data (nodes and edges) is also visible in network objects in R. In tidygraph, these objects are named tbl_graph. Let’s construct one of those with some example node and edge data.

node_list <- tibble(id = c(1:5))

edge_list <- tibble(from = c(1,1,1,2,3,3,3,4,5,5,5), to = c(2,2,3,4,2,4,5,5,2,2,2)) %>%
  group_by(from, to) %>%
  summarise(weight = n())

undir_net <- tbl_graph(nodes = node_list, edges = edge_list, directed = FALSE, node_key = "id")
undir_net
## # A tbl_graph: 5 nodes and 8 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 5 x 1 (active)
##      id
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## #
## # Edge Data: 8 x 3
##    from    to weight
##   <int> <int>  <int>
## 1     1     2      2
## 2     1     3      1
## 3     2     4      1
## # ... with 5 more rows

The minimum content of node data is an identifier variable (here id). The identifier has to appear in the edge data as well. There it indicates which nodes are connected to one another (here from and to). I already added a weight to the edges. This is simply a count of how many connections are between a pair of nodes. The resulting graph is undirected (directed = FALSE), which can be translated as a ‘mutual relationship’.

We can now plot this object with ggraph, which basically works like ggplot2 with some network specific features. The layout option defines the algorithm that is responsible for how the nodes and edges are positioned to one another. The layouts available are the same as in the igraph function layout_with_*. ‘kk’ stands for Kamada-Kawai, a very common layout algorithm that distributes nodes equally in space.

pal <- viridis::viridis(n = 6, begin = .3, end = .75) # define color palette for plotting

# undirected, unweighted, no attributes  
ggraph(undir_net, layout = "kk") + 
  geom_edge_link() + 
  geom_node_point(size = 10, colour = pal[1]) +
  geom_node_text(aes(label = id), 
                 colour = "white", vjust = 0.4) + 
  theme_graph() # theme without axes, gridlines etc.
Undirected, unweighted graph without attributes

Undirected, unweighted graph without attributes

By setting directed = TRUE the edges in the graph are now directed towards some nodes. Some real-world examples are cash or trade flows, social media communication, or non-mutual friendship 😢

dir_net <- tbl_graph(nodes = node_list, edges = edge_list, directed = TRUE, node_key = "id")

ggraph(dir_net, layout = "kk") + 
  geom_edge_link(arrow = arrow(angle = 15, type = "closed", length = unit(4, "mm")), 
                 end_cap = circle(4, "mm")) + # so arrows don't overlap nodes 
  geom_node_point(size = 10, colour = pal[1]) + 
  geom_node_text(aes(label = id), 
                 colour = "white", vjust = 0.4) + 
  theme_graph()
Directed, unweighted graph without attributes

Directed, unweighted graph without attributes

We can also visualise the weight of the edges to show how ‘strong’ certain ties are, using the width argument.

# undirected, weighted, no attributes
ggraph(undir_net, layout = "kk") + 
  geom_edge_link(aes(width = weight), 
                 alpha = 0.7) + 
  geom_node_point(size = 10, colour = pal[1]) +
  geom_node_text(aes(label = id), 
                 colour = "white", vjust = 0.4) +
  theme_graph(base_family="sans") # provide font family, otherwise can't render document (Windows)
Undirected, weighted graph without attributes

Undirected, weighted graph without attributes

We are often interested in certain attributes of nodes or edges. To add some to the example data, tidygraph needs to know which data we want to alter (nodes or edges). Therefore, the package contains the activate function. Then, we can manipulate data with all the dplyr verbs we know and (mostly) love.

# undirected, unweighted, with attributes
undir_net.att <- undir_net %>%
  activate(nodes) %>%
  mutate(Preference = rep(c("Python", "R"), c(3, 2))) %>%
  activate(edges) %>% 
  mutate(Relationship = sample(c("Friends", "Foes"), 8, replace = TRUE))

ggraph(undir_net.att, layout = "kk") + 
  geom_edge_link(aes(label = Relationship), 
                 angle_calc = "along", label_dodge = unit(2.5, "mm"), label_push = unit(10, "mm"),
                 alpha = 0.7) + 
  geom_node_point(aes(colour = Preference),
                  size = 10) +
  scale_color_manual(values = pal[c(6,1)]) +
  geom_node_text(aes(label = id), 
                 colour = "white", vjust = 0.4) + 
  theme_graph(base_family="sans") # provide font family, otherwise can't render document (Windows)
Undirected, unweighted graph with attributes

Undirected, unweighted graph with attributes

The concept of attributes can be pushed further: we can say that nodes are of different types. This results in bipartite (or two-mode) networks, where nodes of the same type are not directly connected to one another, but only through the nodes of the other type. This could be employees in firms or authors of research papers, for example.

bipart_net <- play_bipartite(8, 2, p=0.8, directed = FALSE) %>% # play_* generates different types of networks
  activate(nodes) %>%
  mutate("Node.type" = as.character(if_else(type==TRUE, "Firm", "Employee")))

ggraph(bipart_net, layout = "stress") +
  geom_edge_link() +
  geom_node_point(aes(shape = Node.type , color = Node.type), 
                  size = 6) + 
  scale_color_manual(values = pal[c(1,6)]) +
  theme_graph(base_family="sans") # provide font family, otherwise can't render document (Windows)
Bi-partite graph

Bi-partite graph

Network Data in the Wild: The Matrix Trilogy

As described above, we need a node and an edge set to do network analysis in R. However, life out there seldom provides us with data in this specific format. This is contrary to, e.g., survey data that is already in a ready-to-use format (apart from some variable recoding etc.).

Instead, networks are mostly displayed as different kinds of matrices. How these relate to one another can be confusing and is something we don’t usually have to deal with in standard quant research. There are three types of matrices that can be used to describe a network and ‘translating’ them into the desired format varies by type. Often your data isn’t even a matrix yet. In that case, you first have to figure out to which of the following formats you can/should transform it.

Adjacency Matrix (aka. Sociomatrix)

An adjacency matrix is basically a cross-table of the same elements (mostly of the nodes), and is therefore square. The cell values can be restricted to 0/1 to indicate whether there are any connections between the elements, or be a count of the connections.

adj_mat <- matrix(sample(0:3, 16, replace = TRUE), nrow = 4)

colnames(adj_mat) <- rownames(adj_mat) <- LETTERS[1:4]

adj_mat
##   A B C D
## A 3 1 2 1
## B 0 2 3 0
## C 3 1 3 2
## D 1 3 3 2

tidygraph can create tbl_graph objects from a variety of data formats, also adjacency matrices.

adj_to_net <- as_tbl_graph(adj_mat, directed = FALSE)
adj_to_net
## # A tbl_graph: 4 nodes and 10 edges
## #
## # An undirected multigraph with 1 component
## #
## # Node Data: 4 x 1 (active)
##   name 
##   <chr>
## 1 A    
## 2 B    
## 3 C    
## 4 D    
## #
## # Edge Data: 10 x 3
##    from    to weight
##   <int> <int>  <dbl>
## 1     1     1      3
## 2     1     2      1
## 3     1     3      3
## # ... with 7 more rows

Incidence Matrix

An incidence matrix in contrast, is a cross-table of different elements, e.g. nodes and edges or different types of nodes as in a bipartite graph.

inc_mat <- matrix(sample(0:3, 12, replace = TRUE), nrow = 3)

rownames(inc_mat) <- LETTERS[1:3]
colnames(inc_mat) <- letters[1:4]

inc_mat
##   a b c d
## A 0 0 3 0
## B 0 0 2 1
## C 3 2 0 0

Converting the incidence matrix to a network results in this:

inc_to_net <- as_tbl_graph(inc_mat, directed = FALSE)
inc_to_net
## # A tbl_graph: 7 nodes and 5 edges
## #
## # An unrooted forest with 2 trees
## #
## # Node Data: 7 x 2 (active)
##   type  name 
##   <lgl> <chr>
## 1 FALSE A    
## 2 FALSE B    
## 3 FALSE C    
## 4 TRUE  a    
## 5 TRUE  b    
## 6 TRUE  c    
## # ... with 1 more row
## #
## # Edge Data: 5 x 3
##    from    to weight
##   <int> <int>  <dbl>
## 1     1     6      3
## 2     2     6      2
## 3     2     7      1
## # ... with 2 more rows

The type variable is a logical that denotes the node-type in a bipartite network.

Edgelist

We already know this format and it’s just a matrix in disguise. It consists of two columns with labels/names of elements that are connected to one another, and sometimes a weight column. It’s actually just the edgelist as in the example network before, yay!

edge_list
## # A tibble: 8 x 3
## # Groups:   from [5]
##    from    to weight
##   <dbl> <dbl>  <int>
## 1     1     2      2
## 2     1     3      1
## 3     2     4      1
## 4     3     2      1
## 5     3     4      1
## 6     3     5      1
## 7     4     5      1
## 8     5     2      3
elist_to_net <- as_tbl_graph(edge_list, directed = FALSE)
elist_to_net
## # A tbl_graph: 5 nodes and 8 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 5 x 1 (active)
##   name 
##   <chr>
## 1 1    
## 2 2    
## 3 3    
## 4 4    
## 5 5    
## #
## # Edge Data: 8 x 3
##    from    to weight
##   <int> <int>  <int>
## 1     1     2      2
## 2     1     3      1
## 3     2     4      1
## # ... with 5 more rows

An edgelist is sufficient to create a network, but often we have some additional data with node attributes like the R and Python users in the example.

Real-life Application: Co-authorship Networks

For a term paper, I downloaded publication lists of the SOCIUM research centre. After bringing them into a reasonable format - that is an incidence matrix where authors are connected to publications (bipartite network) - I turned everything into an adjacency matrix connecting authors to one another (this is the only matrix algebra I know). I then added some author attributes like department membership, department position, and gender.

soc_inc <- readRDS(here("data", "socpub_bipart.RDS")) # incidence matrix with authors as rows, publications as cols
rownames(soc_inc)[1:3]
## [1] "Aagaard, Lise"        "Abholz, Heinz-Harald" "Abramowski, Ruth"
colnames(soc_inc)[1:3]
## [1] "Social Stratification and Social Movements.; 2020"                          
## [2] "Routledge Handbook of; 2020"                                                
## [3] "Die militaerische Elite des Kaiserreichs. 24 Lebenslaeufe, Darmstadt:; 2020"
soc_adj <- soc_inc %*% t(soc_inc) #  inc. matrix x transponse(inc. matrix) = adj. matrix
rownames(soc_adj)[1:3]
## [1] "Aagaard, Lise"        "Abholz, Heinz-Harald" "Abramowski, Ruth"
colnames(soc_adj)[1:3]
## [1] "Aagaard, Lise"        "Abholz, Heinz-Harald" "Abramowski, Ruth"
soc_1mode <- as_tbl_graph(soc_adj, directed = FALSE)
soc_1mode
## # A tbl_graph: 1239 nodes and 5635 edges
## #
## # An undirected multigraph with 45 components
## #
## # Node Data: 1,239 x 1 (active)
##   name                
##   <chr>               
## 1 Aagaard, Lise       
## 2 Abholz, Heinz-Harald
## 3 Abramowski, Ruth    
## 4 Acksel, Britta      
## 5 Adam, Christian     
## 6 Agartan, Tuba I.    
## # ... with 1,233 more rows
## #
## # Edge Data: 5,635 x 3
##    from    to weight
##   <int> <int>  <dbl>
## 1     1     1      3
## 2     1    28      3
## 3     1   141      3
## # ... with 5,632 more rows
#soc_2mode <- as_tbl_graph(socpub_inc, directed = FALSE) # this would give the bipartite/two-mode network
#soc_2mode

# read in author attributes and add to graph
auth_att <- as.data.frame(read_excel(here("data", "auth_attributes.xlsx")))
auth_att %<>%
  mutate(socium = if_else(is.na(dep_cat), 0, 1)) # indicator for socium member

soc_attr <- soc_1mode %>% 
  activate(nodes) %>% 
  left_join(auth_att)

soc_attr %N>% # shortcut for activate(nodes), %E>% for edges
  as_tibble
## # A tibble: 1,239 x 14
##    name   dep1  dep2  dep3  dep4  dep5  dep6 dep_cat dep_cat2 female    al   agl
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>  <dbl> <dbl> <dbl>
##  1 Aaga~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
##  2 Abho~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
##  3 Abra~    NA    NA     1    NA    NA    NA       3       NA      1    NA    NA
##  4 Acks~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
##  5 Adam~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
##  6 Agar~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
##  7 Ahau~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
##  8 Albe~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
##  9 Alje~     1    NA    NA    NA    NA    NA       1       NA      0    NA    NA
## 10 Alsc~    NA    NA    NA    NA    NA    NA      NA       NA     NA    NA    NA
## # ... with 1,229 more rows, and 2 more variables: tot_pub <dbl>, socium <dbl>

Centrality Measures

Centrality measures can be calculated to identify the “importance” of an actor within a network. As importance can be defined in various ways, there are also different types of centrality (31 of which are built into tidygraph). I will introduce only the four most widely used measures here: degree, betwenness, eigenvector, and closeness centrality.

Degree Centrality

Degree centrality simply counts the edges a node has. In directed networks, you can also specify whether you want to look at the in- or out-degree. In the co-authorship network, the most central actor is the one with the most co-authors (multiple ties are allowed).

# add cent. measures to nodes 
soc_meas <- 
  soc_attr %N>% 
  mutate(degree = centrality_degree(),                       
         betweenness = round(centrality_betweenness(weights = NULL), 2),
         eigen = round(centrality_eigen(), 2),
         closeness = centrality_closeness())
## Warning: Problem with `mutate()` input `closeness`.
## i At centrality.c:2784 :closeness centrality is not well-defined for disconnected graphs
## i Input `closeness` is `centrality_closeness()`.
## Warning in closeness(graph = graph, vids = V(graph), mode = mode, weights =
## weights, : At centrality.c:2784 :closeness centrality is not well-defined for
## disconnected graphs
# show name and value of top 10 nodes
soc_meas %N>%              
  arrange(desc(degree)) %>% 
  select(name, degree) %>%
  as_tibble() %>% 
  head(10)
## # A tibble: 10 x 2
##    name              degree
##    <chr>              <dbl>
##  1 Glaeske, Gerd        165
##  2 Rothgang, Heinz      147
##  3 Haunss, Sebastian     84
##  4 Hoffmann, Falk        75
##  5 Mozygemba, Kati       60
##  6 Giersiepen, Klaus     52
##  7 Sommer, Moritz        50
##  8 Gerhardus, Ansgar     48
##  9 Nullmeier, Frank      48
## 10 Czwikla, Jonas        41

Betweenness Centrality

Betweenness centrality counts the shortest paths that go through a node. A path is any series of connected (adjacent) nodes, and it is shorter the fewer nodes have to be passed through. If many shortest paths go through a node, it is an important actor for the efficiency of the network. A node with high betweenness centrality can also be interpreted as a bridging actor, working like a hub through which other nodes are connected.

soc_meas %N>%
  arrange(desc(betweenness)) %>% 
  select(name, betweenness) %>%
  as_tibble() %>% 
  head(10)
## # A tibble: 10 x 2
##    name               betweenness
##    <chr>                    <dbl>
##  1 Rothgang, Heinz        238506.
##  2 Glaeske, Gerd          155643.
##  3 Nullmeier, Frank       135213.
##  4 Haunss, Sebastian       91706.
##  5 Schneider, Steffen      83009.
##  6 Huinink, Johannes       77915.
##  7 Schimank, Uwe           68006.
##  8 Leibfried, Stephan      65570.
##  9 Buhr, Petra             63849.
## 10 Braun, Bernard          57192.

Closeness Centrality

Closeness Centrality also includes the shortest paths. It is the inverse of the average distance (defined as shortest path) between a node and all other nodes.

soc_meas %N>%
  arrange(desc(closeness)) %>% 
  select(name, closeness) %>%
  as_tibble() %>% 
  head(10)
## # A tibble: 10 x 2
##    name                closeness
##    <chr>                   <dbl>
##  1 Rothgang, Heinz    0.00000502
##  2 Glaeske, Gerd      0.00000502
##  3 Mueller, Rolf      0.00000501
##  4 Nullmeier, Frank   0.00000501
##  5 Schneider, Steffen 0.00000501
##  6 Braun, Bernard     0.00000501
##  7 Giersiepen, Klaus  0.00000501
##  8 Schmid, Achim      0.00000501
##  9 Schmiemann, Guido  0.00000501
## 10 Hoffmann, Falk     0.00000501

If a network is largely disconnected (it is split into multiple components or even isolates), it doesn’t really make sense to calculate closeness centrality. Therefore, I redefine it only for the largest component:

larg_comp <- 
  soc_meas %N>% 
  mutate(component = group_components() %>% 
           factor()) %>%
  group_by(component) %>%
  mutate(comp_size = n()) %>%  
  ungroup() %>% 
  filter(comp_size == max(comp_size)) %>% 
  mutate(closeness = centrality_closeness()) 
## Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## Please use the `.add` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
larg_comp %N>%
  arrange(desc(closeness)) %>% 
  select(name, closeness) %>%
  as_tibble() %>% 
  head(10)
## # A tibble: 10 x 2
##    name               closeness
##    <chr>                  <dbl>
##  1 Rothgang, Heinz     0.000299
##  2 Glaeske, Gerd       0.000275
##  3 Mueller, Rolf       0.000273
##  4 Nullmeier, Frank    0.000267
##  5 Schneider, Steffen  0.000266
##  6 Braun, Bernard      0.000265
##  7 Giersiepen, Klaus   0.000258
##  8 Schmid, Achim       0.000252
##  9 Schmiemann, Guido   0.000252
## 10 Hoffmann, Falk      0.000251

Eigenvector Centrality

Eigenvector centrality not only looks at how many ties a node (ego) has, but also how many ties its alters have. High eigenvector centrality means being connected to many other well connected nodes.

soc_meas %N>%
  arrange(desc(eigen)) %>% 
  select(name, eigen) %>%
  as_tibble() %>% 
  head(10)
## # A tibble: 10 x 2
##    name                   eigen
##    <chr>                  <dbl>
##  1 Mozygemba, Kati         1   
##  2 Gerhardus, Ansgar       0.96
##  3 Van der Wilt, Gert Jan  0.93
##  4 Brereton, Louise        0.91
##  5 Refolo, Pietro          0.91
##  6 Sacchini, Dario         0.91
##  7 Tummers, Marcia         0.91
##  8 Wahlster, Philip        0.91
##  9 Rehfuess, Eva Annette   0.9 
## 10 Lysdahl, Kristin Bakke  0.89

Static Plotting with ggraph

We can now plot the network highlighting the different centrality measures. To increase visibility, the network is reduced to SOCIUM-members.

soc_sub <- 
  soc_meas %>%
  filter(socium == 1)

larg_comp_sub <- 
  larg_comp %>% 
  filter(socium == 1)

ggraph(soc_sub, layout = "fr") +
  geom_edge_link() +
  geom_node_point(aes(fill = as.factor(dep_cat)), 
                  shape = 21, size = 2) +
  labs(fill = "Department") +
  theme_graph(base_family="sans")

ggraph(soc_sub, layout = "fr") +
  geom_edge_link(alpha = .7) +
  geom_node_point(aes(fill = as.factor(dep_cat), size = degree),
                  shape = 21) +
  geom_node_label(aes(filter = degree>=50, label =  name), 
                  vjust = 3, size = 2, alpha = 0.8, label.padding = 0.1, repel = TRUE) +
  labs(fill = "Department", size = "Degree Centrality") +
  theme_graph(base_family="sans")

ggraph(soc_sub, layout = "fr") +
  geom_edge_link(alpha = .7) +
  geom_node_point(aes(fill = as.factor(dep_cat), size = betweenness),
                  shape = 21) +  
  geom_node_label(aes(filter = betweenness>=75000, label =  name), 
                  vjust = 3, size = 2, alpha = 0.8, label.padding = 0.1, repel = TRUE) +
  labs(fill = "Department", size = "Betweenness Centrality") +
  theme_graph(base_family="sans")

ggraph(larg_comp_sub, layout = "fr") + # plot only largest component for closeness 
  geom_edge_link() +
  geom_node_point(aes(fill = as.factor(dep_cat), size = (closeness)^2),
                  shape = 21) +
  geom_node_label(aes(filter = closeness>0.00026, label =  name), 
                  vjust = 3, size = 2, alpha = 0.8, label.padding = 0.1, repel = TRUE) +
  labs(fill = "Department", size = "Closeness Centrality\n(squared)") +
  theme_graph(base_family="sans")

ggraph(soc_sub, layout = "fr") +
  geom_edge_link(alpha = .7) +
  geom_node_point(aes(fill = as.factor(dep_cat), size = eigen),
                  shape = 21) +
  geom_node_label(aes(filter = eigen>=0.1, label =  name), 
                 vjust = 3, size = 2, alpha = 0.8, label.padding = 0.1, repel = TRUE) +
  labs(fill = "Department", size = "Eigenvector Centrality") +
  theme_graph(base_family="sans")

Interactive Plotting with visNetwork

node_df <- soc_sub %N>% 
  mutate(id = c(1:240)) %>% 
  rename(label = name, group = dep_cat, size = degree) %>% 
  arrange(group) %>% 
  as.data.frame() 

edge_df <- soc_sub %E>% 
  filter(!edge_is_loop()) %>% # remove loops
  as.data.frame() 

visNetwork(node_df, edge_df, width = "100%", height = "600px") %>% 
  visNodes(shadow = TRUE, font = list(size = 30)) %>% 
  visGroups(groupname = as.character(1), color = pal[1]) %>% # have to specify groups individually
  visGroups(groupname = as.character(2), color = pal[3]) %>% 
  visGroups(groupname = as.character(3), color = pal[4]) %>% 
  visGroups(groupname = as.character(4), color = pal[5]) %>% 
  visGroups(groupname = as.character(5), color = pal[6]) %>% 
  visIgraphLayout(layout = "layout_with_fr") %>% 
  visLegend(main="Department", position="right", ncol=1) %>% 
  visOptions(selectedBy = "group")

Resources